4  Re-weighting to a target population

Author

Janick Weberpals, RPh, PhD

Published

September 13, 2024

This is a reproducible example on how to incorporate population weights to match distributions of a target population in multiple imputation > matching/weighting > balance assessment > outcome analysis workflows.

Load packages:

Show the code
library(here)
library(dplyr)
library(survival)
library(mice)
library(MatchThem)
library(MatchIt)
library(survey)
library(gtsummary)

# calling functions
source(here::here("functions", "source_encore.io_functions.R"))

# track time
runtime <- tictoc::tic()

4.1 Data generation

We use the simulate_flaura() function to simulate a realistic oncology comparative effectiveness cohort analytic dataset.

Show the code
# load example dataset with missing observations
data_miss <- simulate_flaura(
  n_total = 3500, 
  seed = 41, 
  include_id = FALSE, 
  imposeNA = TRUE
  ) |> 
  # anesrake works best with factor variables
  # create age category with age less than 65
  mutate(dem_age_lt65 = factor(ifelse(dem_age_index_cont < 65, "<65", "65+"))) |> 
  # convert dem_race into a binary Asian vs. non-Asian 
  mutate(dem_race = factor(ifelse(dem_race == "Asian", "Asian", "Non-Asian"))) |>
  # convert dem_sex_cont into a factor 
  mutate(dem_sex_cont = factor(ifelse(dem_sex_cont == "1", "Male", "Female"))) |> 
  # convert dem_sex_cont into a factor 
  mutate(c_smoking_history = factor(ifelse(c_smoking_history == TRUE, "Current/former", "Never"))) |> 
  # convert c_ecog_cont into a factor 
  mutate(across(c(c_ecog_cont), function(x) factor(as.character(x))))

4.2 Multiple imputation

Multiple imputation using mice:

Show the code
# impute data
data_imp <- futuremice(
  parallelseed = 42,
  n.core = 7,
  data = data_miss,
  method = "rf",
  m = 10,
  print = FALSE
  )

4.3 Defining target distributions

Before applying the re-weighting, we need to define the target distributions of patient characteristics that we want to match from the clinical trial using the raking procedure. The following distributions are taken from Table 1 of the FLAURA trial.

Table 4.1: FLAURA trial Table 1; in OS analysis race was simplified to Asian vs. non-Asian
Show the code
# Define FLAURA distributions for key covariates --------------------------
# order is as in Table 1

## age (taken from https://www.tagrissohcp.com/metastatic/flaura/efficacy.html)
# less than 65 years (54%, TRUE) to 65+ (46%, FALSE)
age_target <- c(.54, .46)
names(age_target) <- c("<65", "65+")

## sex ---------------------------------------------------------------------

# female (0) to male (1) proportion:
sex_target <- c(.63, .37) 
names(sex_target) <- c("Female", "Male")

## race --------------------------------------------------------------------
# asian, non-asian
# asian (TRUE) to non-asian (FALSE) proportion
# note: logical variables in dataframe can be matched to a numeric vector of length 2 and ordered with the TRUE target as the first element and the FALSE target as the second element.
race_target <- c(.62, .38)
names(race_target) <- c("Asian", "Non-Asian")

## smoking -----------------------------------------------------------------

# current/former smoker (TRUE) to never smoker (FALSE) proportion
# note: logical variables in dataframe can be matched to a numeric vector of length 2 and ordered with the TRUE target as the first element and the FALSE target as the second element.
smoker_target <- c(.35, .65)
names(smoker_target) <- c("Current/former", "Never")

## ecog --------------------------------------------------------------------

# ecog 0 to ecog 1 proportion
ecog_target <- c(.41, .59)
names(ecog_target) <- c("0", "1")

# summarize target distributions in a named list vector --------------
targets <- list(age_target, sex_target, race_target, smoker_target, ecog_target)
names(targets) <- c("dem_age_lt65", "dem_sex_cont", "dem_race", "c_smoking_history", "c_ecog_cont")

# print
targets
$dem_age_lt65
 <65  65+ 
0.54 0.46 

$dem_sex_cont
Female   Male 
  0.63   0.37 

$dem_race
    Asian Non-Asian 
     0.62      0.38 

$c_smoking_history
Current/former          Never 
          0.35           0.65 

$c_ecog_cont
   0    1 
0.41 0.59 

4.4 Propensity score matching and re-weighting

In this step, propensity score matching and re-weighting of key patient characteristics to match those of the original RCT is performed across all imputed datasets.

The propensity score model is specified as follows:

Show the code
# apply propensity score matching on mids object
ps_form <- as.formula(paste("treat ~", paste(covariates_for_ps, collapse = " + ")))
ps_form
treat ~ dem_age_index_cont + dem_sex_cont + c_smoking_history + 
    c_number_met_sites + c_hemoglobin_g_dl_cont + c_urea_nitrogen_mg_dl_cont + 
    c_platelets_10_9_l_cont + c_calcium_mg_dl_cont + c_glucose_mg_dl_cont + 
    c_lymphocyte_leukocyte_ratio_cont + c_alp_u_l_cont + c_protein_g_l_cont + 
    c_alt_u_l_cont + c_albumin_g_l_cont + c_bilirubin_mg_dl_cont + 
    c_chloride_mmol_l_cont + c_monocytes_10_9_l_cont + c_eosinophils_leukocytes_ratio_cont + 
    c_ldh_u_l_cont + c_hr_cont + c_sbp_cont + c_oxygen_cont + 
    c_ecog_cont + c_neutrophil_lymphocyte_ratio_cont + c_bmi_cont + 
    c_ast_alt_ratio_cont + c_stage_initial_dx_cont + dem_race + 
    dem_region + dem_ses + c_time_dx_to_index

The matching and re-weighting is performed using the re_weight() function. This function is a wrapper for matchit() and weightit() in combination with the anesrake() function which performs the raking (= re-weighting) procedure.

We apply this function to each imputed dataset. Before doing so, the imputed datasets, which are currently stored as a mids object, needs to be converted to a list of dataframes:

Show the code
# create a mild object containing lists of data.frames
data_mild <- mice::complete(data = data_imp, action = "all", include = FALSE)

summary(data_mild)
   Length Class      Mode
1  40     data.frame list
2  40     data.frame list
3  40     data.frame list
4  40     data.frame list
5  40     data.frame list
6  40     data.frame list
7  40     data.frame list
8  40     data.frame list
9  40     data.frame list
10 40     data.frame list

The lapply function loops the function through each dataframe and returns a list of matchit objects which contain imputed > matched > re-weighted datasets. To take advantage of the features that come with the cobalt and matchthem packages, the function stores the raking weights as sampling weights (s.weights).

Show the code
# call match re-weight
matchit_out_list <- lapply(
  # list of dataframes
  X = data_mild, 
  # call function
  FUN = re_weight,
  # target distributions
  targets = targets,
  # should matching or weighting be performed
  matching_weighting = "matching",
  # matching arguments passed on to matchit() function
  formula = ps_form,
  ratio = 1,
  method = "nearest",
  distance = "glm",
  link = "logit",
  caliper = 0.05,
  replace = F
  )
[1] "Raking converged in 10 iterations"
[1] "Raking converged in 12 iterations"
[1] "Raking converged in 12 iterations"
[1] "Raking converged in 11 iterations"
[1] "Raking converged in 12 iterations"
[1] "Raking converged in 10 iterations"
[1] "Raking converged in 15 iterations"
[1] "Raking converged in 10 iterations"
[1] "Raking converged in 10 iterations"
[1] "Raking converged in 10 iterations"
[1] "Raking converged in 12 iterations"
[1] "Raking converged in 10 iterations"
[1] "Raking converged in 10 iterations"
[1] "Raking converged in 10 iterations"

We can inspect the output of the first imputed > matched > re-weighted dataset.

Show the code
matchit_out_list[[1]]
A matchit object
 - method: 1:1 nearest neighbor matching without replacement
 - distance: Propensity score [caliper]
             - estimated with logistic regression
             - sampling weights not included in estimation
 - caliper: <distance> (0.003)
 - number of obs.: 3500 (original), 2874 (matched)
 - sampling weights: present
 - target estimand: ATT
 - covariates: dem_age_index_cont, dem_sex_cont, c_smoking_history, c_number_met_sites, c_hemoglobin_g_dl_cont, c_urea_nitrogen_mg_dl_cont, c_platelets_10_9_l_cont, c_calcium_mg_dl_cont, c_glucose_mg_dl_cont, c_lymphocyte_leukocyte_ratio_cont, c_alp_u_l_cont, c_protein_g_l_cont, c_alt_u_l_cont, c_albumin_g_l_cont, c_bilirubin_mg_dl_cont, c_chloride_mmol_l_cont, c_monocytes_10_9_l_cont, c_eosinophils_leukocytes_ratio_cont, c_ldh_u_l_cont, c_hr_cont, c_sbp_cont, c_oxygen_cont, c_ecog_cont, c_neutrophil_lymphocyte_ratio_cont, c_bmi_cont, c_ast_alt_ratio_cont, c_stage_initial_dx_cont, dem_race, dem_region, dem_ses, c_time_dx_to_index

4.5 Table 1

To check if the re-weighting process worked, we can extract the matched patients and compare a Table 1 that does not include the weights vs. a Table that considers the weights. For this example, we look at the first imputed > matched > re-weighted dataset.

Show the code
# extract the matched of
first_dataset <- get_matches(
  object = matchit_out_list[[1]]
  )

Reminder : The target distributions look like this

Show the code
targets
$dem_age_lt65
 <65  65+ 
0.54 0.46 

$dem_sex_cont
Female   Male 
  0.63   0.37 

$dem_race
    Asian Non-Asian 
     0.62      0.38 

$c_smoking_history
Current/former          Never 
          0.35           0.65 

$c_ecog_cont
   0    1 
0.41 0.59 
Show the code
library(cardx)
library(smd)

# print
first_dataset |>
  tbl_summary(
    by = treat,
    include = c(dem_age_index_cont, names(targets))
    ) |> 
  add_difference(test = dplyr::everything() ~ "smd") |>
  add_overall() |>
  modify_column_hide(columns = "conf.low") |> 
  modify_header(
    label ~ "**Patient characteristic**",
    stat_0 ~ "**Total** <br> N = {round(N, 2)}",
    stat_1 ~ "**{level}** <br> N = {round(n, 2)} <br> ({style_percent(p, digits=1)}%)",
    stat_2 ~ "**{level}** <br> N = {round(n, 2)} <br> ({style_percent(p, digits=1)}%)"
    ) |>
  modify_spanning_header(c("stat_1", "stat_2") ~ "**Treatment received**")
Table 4.2: Table 1 BEFORE re-weighting

Patient characteristic

Total
N = 2874

1

Treatment received

Difference

2

0
N = 1437
(50.0%)

1

1
N = 1437
(50.0%)

1
dem_age_index_cont 69 (64, 74) 69 (64, 74) 69 (64, 75) -0.01
dem_age_lt65


0.00
    <65 895 (31%) 449 (31%) 446 (31%)
    65+ 1,979 (69%) 988 (69%) 991 (69%)
dem_sex_cont


0.00
    Female 1,881 (65%) 939 (65%) 942 (66%)
    Male 993 (35%) 498 (35%) 495 (34%)
dem_race


0.01
    Asian 948 (33%) 478 (33%) 470 (33%)
    Non-Asian 1,926 (67%) 959 (67%) 967 (67%)
c_smoking_history


0.02
    Current/former 1,255 (44%) 622 (43%) 633 (44%)
    Never 1,619 (56%) 815 (57%) 804 (56%)
c_ecog_cont


0.01
    0 1,331 (46%) 669 (47%) 662 (46%)
    1 1,543 (54%) 768 (53%) 775 (54%)
1

Median (Q1, Q3); n (%)

2

Standardized Mean Difference

Show the code
# create survey object 
data_svy <- svydesign(ids = ~ 1, weights = ~ weights, data = first_dataset)

# print
data_svy |>
  tbl_svysummary(
    by = treat,
    include = c(dem_age_index_cont, names(targets))
    ) |> 
  add_difference(test = dplyr::everything() ~ "smd") |>
  add_overall() |>
  modify_column_hide(columns = "conf.low") |> 
  modify_header(
    label ~ "**Patient characteristic**",
    stat_0 ~ "**Total** <br> N = {round(N, 2)}",
    stat_1 ~ "**{level}** <br> N = {round(n, 2)} <br> ({style_percent(p, digits=1)}%)",
    stat_2 ~ "**{level}** <br> N = {round(n, 2)} <br> ({style_percent(p, digits=1)}%)"
    ) |>
  modify_spanning_header(c("stat_1", "stat_2") ~ "**Treatment received**")
Table 4.3: Table 1 AFTER re-weighting

Patient characteristic

Total
N = 2874

1

Treatment received

Difference

2

0
N = 1443.54
(50.2%)

1

1
N = 1430.46
(49.8%)

1
dem_age_index_cont 64 (60, 72) 64 (60, 72) 65 (61, 72) -0.03
dem_age_lt65


0.00
    <65 1,552 (54%) 781 (54%) 771 (54%)
    65+ 1,322 (46%) 663 (46%) 659 (46%)
dem_sex_cont


0.02
    Female 1,811 (63%) 917 (64%) 893 (62%)
    Male 1,063 (37%) 526 (36%) 537 (38%)
dem_race


0.01
    Asian 1,782 (62%) 899 (62%) 882 (62%)
    Non-Asian 1,092 (38%) 544 (38%) 548 (38%)
c_smoking_history


0.01
    Current/former 1,006 (35%) 503 (35%) 503 (35%)
    Never 1,868 (65%) 941 (65%) 927 (65%)
c_ecog_cont


0.02
    0 1,178 (41%) 598 (41%) 580 (41%)
    1 1,696 (59%) 845 (59%) 850 (59%)
1

Median (Q1, Q3); n (%)

2

Standardized Mean Difference

4.5.1 Comparison of custom function vs. matchthem()

Lastly, we want to make sure that our custom function results in the same matched datasets as the matchthem() function which we use in the main analysis - not considering the re-weighting.

For this demonstration, we use the same matching parameters, but without re-weighting after matching in our custom function.

4.5.2 Custom function

We run again our custom function but with targets = NULL to not re-weight any of the included covariates. To convert the returned output of a list of matchit objects into an object of type mimids we use the MatchThem::as.mimids() function.

Show the code
# call match re-weight
set.seed(42)
matchit_out_list <- lapply(
  X = data_mild, 
  FUN = re_weight,
  targets = NULL,
  matching_weighting = "matching",
  formula = ps_form,
  ratio = 1,
  method = "nearest",
  distance = "glm",
  link = "logit",
  caliper = 0.05,
  replace = F
  )
No target distributions specified, no re-weighting will be performed.
No target distributions specified, no re-weighting will be performed.
No target distributions specified, no re-weighting will be performed.
No target distributions specified, no re-weighting will be performed.
No target distributions specified, no re-weighting will be performed.
No target distributions specified, no re-weighting will be performed.
No target distributions specified, no re-weighting will be performed.
No target distributions specified, no re-weighting will be performed.
No target distributions specified, no re-weighting will be performed.
No target distributions specified, no re-weighting will be performed.
Show the code
# convert the output into a mimids object
data_mimids_from_function <- MatchThem::as.mimids(
  x = matchit_out_list, 
  datasets = data_imp
  )

data_mimids_from_function
Printing               | dataset: #1
A matchit object
 - method: 1:1 nearest neighbor matching without replacement
 - distance: Propensity score [caliper]
             - estimated with logistic regression
 - caliper: <distance> (0.003)
 - number of obs.: 3500 (original), 2874 (matched)
 - target estimand: ATT
 - covariates: dem_age_index_cont, dem_sex_cont, c_smoking_history, c_number_met_sites, c_hemoglobin_g_dl_cont, c_urea_nitrogen_mg_dl_cont, c_platelets_10_9_l_cont, c_calcium_mg_dl_cont, c_glucose_mg_dl_cont, c_lymphocyte_leukocyte_ratio_cont, c_alp_u_l_cont, c_protein_g_l_cont, c_alt_u_l_cont, c_albumin_g_l_cont, c_bilirubin_mg_dl_cont, c_chloride_mmol_l_cont, c_monocytes_10_9_l_cont, c_eosinophils_leukocytes_ratio_cont, c_ldh_u_l_cont, c_hr_cont, c_sbp_cont, c_oxygen_cont, c_ecog_cont, c_neutrophil_lymphocyte_ratio_cont, c_bmi_cont, c_ast_alt_ratio_cont, c_stage_initial_dx_cont, dem_race, dem_region, dem_ses, c_time_dx_to_index

4.5.3 matchthem() function

The following code resembles the code we would use in the main analysis by implementing the generic matchthem() function.

Show the code
# matching
set.seed(42)
data_mimids <- matchthem(
  datasets = data_imp,
  formula = ps_form,
  ratio = 1,
  method = "nearest",
  distance = "glm",
  link = "logit",
  caliper = 0.05,
  replace = F
  )

Matching Observations  | dataset: #1 #2 #3 #4 #5 #6 #7 #8 #9 #10
Show the code
data_mimids
Printing               | dataset: #1
A matchit object
 - method: 1:1 nearest neighbor matching without replacement
 - distance: Propensity score [caliper]
             - estimated with logistic regression
 - caliper: <distance> (0.003)
 - number of obs.: 3500 (original), 2874 (matched)
 - target estimand: ATT
 - covariates: dem_age_index_cont, dem_sex_cont, c_smoking_history, c_number_met_sites, c_hemoglobin_g_dl_cont, c_urea_nitrogen_mg_dl_cont, c_platelets_10_9_l_cont, c_calcium_mg_dl_cont, c_glucose_mg_dl_cont, c_lymphocyte_leukocyte_ratio_cont, c_alp_u_l_cont, c_protein_g_l_cont, c_alt_u_l_cont, c_albumin_g_l_cont, c_bilirubin_mg_dl_cont, c_chloride_mmol_l_cont, c_monocytes_10_9_l_cont, c_eosinophils_leukocytes_ratio_cont, c_ldh_u_l_cont, c_hr_cont, c_sbp_cont, c_oxygen_cont, c_ecog_cont, c_neutrophil_lymphocyte_ratio_cont, c_bmi_cont, c_ast_alt_ratio_cont, c_stage_initial_dx_cont, dem_race, dem_region, dem_ses, c_time_dx_to_index

4.5.4 Comparison of stacked datasets

We can now stack the datasets (= vertically append them) and compare the resulting 10 x 10 datasets for any differences:

Show the code
waldo::compare(
  MatchThem::complete(data_mimids_from_function), 
  MatchThem::complete(data_mimids)
  )
✔ No differences

4.6 Session info

Script runtime: 1.21 minutes.

Show the code
pander::pander(subset(data.frame(sessioninfo::package_info()), attached==TRUE, c(package, loadedversion)))
  package loadedversion
cardx cardx 0.2.0
dplyr dplyr 1.1.4
gtsummary gtsummary 2.0.1
here here 1.0.1
MatchIt MatchIt 4.5.5
MatchThem MatchThem 1.2.1
Matrix Matrix 1.7-0
mice mice 3.16.0
smd smd 0.7.0
survey survey 4.4-2
survival survival 3.5-8
Show the code
pander::pander(sessionInfo())

R version 4.4.0 (2024-04-24)

Platform: x86_64-pc-linux-gnu

locale: LC_CTYPE=C.UTF-8, LC_NUMERIC=C, LC_TIME=C.UTF-8, LC_COLLATE=C.UTF-8, LC_MONETARY=C.UTF-8, LC_MESSAGES=C.UTF-8, LC_PAPER=C.UTF-8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=C.UTF-8 and LC_IDENTIFICATION=C

attached base packages: grid, stats, graphics, grDevices, datasets, utils, methods and base

other attached packages: smd(v.0.7.0), cardx(v.0.2.0), gtsummary(v.2.0.1), survey(v.4.4-2), Matrix(v.1.7-0), MatchIt(v.4.5.5), MatchThem(v.1.2.1), mice(v.3.16.0), survival(v.3.5-8), dplyr(v.1.1.4) and here(v.1.0.1)

loaded via a namespace (and not attached): tidyselect(v.1.2.1), fastmap(v.1.2.0), digest(v.0.6.37), rpart(v.4.1.23), lifecycle(v.1.0.4), cluster(v.2.1.6), waldo(v.0.5.3), gdata(v.3.0.0), magrittr(v.2.0.3), compiler(v.4.4.0), sass(v.0.4.9), rlang(v.1.1.4), Hmisc(v.5.1-3), tools(v.4.4.0), gt(v.0.11.0), utf8(v.1.2.4), yaml(v.2.3.10), data.table(v.1.16.0), knitr(v.1.48), htmlwidgets(v.1.6.4), xml2(v.1.3.6), withr(v.3.0.1), foreign(v.0.8-86), purrr(v.1.0.2), nnet(v.7.3-19), fansi(v.1.0.6), jomo(v.2.7-6), colorspace(v.2.1-1), future(v.1.34.0), ggplot2(v.3.5.1), gtools(v.3.9.5), globals(v.0.16.3), scales(v.1.3.0), iterators(v.1.0.14), MASS(v.7.3-60.2), cli(v.3.6.3), rmarkdown(v.2.28), crayon(v.1.5.3), generics(v.0.1.3), rstudioapi(v.0.16.0), sessioninfo(v.1.2.2), commonmark(v.1.9.1), minqa(v.1.2.8), DBI(v.1.2.3), pander(v.0.6.5), stringr(v.1.5.1), splines(v.4.4.0), assertthat(v.0.2.1), parallel(v.4.4.0), base64enc(v.0.1-3), mitools(v.2.4), vctrs(v.0.6.5), WeightIt(v.1.3.0), boot(v.1.3-30), glmnet(v.4.1-8), jsonlite(v.1.8.8), mitml(v.0.4-5), Formula(v.1.2-5), htmlTable(v.2.4.3), listenv(v.0.9.1), weights(v.1.0.4), locfit(v.1.5-9.10), foreach(v.1.5.2), tidyr(v.1.3.1), glue(v.1.7.0), parallelly(v.1.38.0), nloptr(v.2.1.1), pan(v.1.9), chk(v.0.9.2), codetools(v.0.2-20), stringi(v.1.8.4), shape(v.1.4.6.1), gtable(v.0.3.5), lme4(v.1.1-35.5), munsell(v.0.5.1), tibble(v.3.2.1), anesrake(v.0.80), pillar(v.1.9.0), furrr(v.0.3.1), htmltools(v.0.5.8.1), R6(v.2.5.1), rprojroot(v.2.0.4), evaluate(v.0.24.0), lattice(v.0.22-6), markdown(v.1.13), cards(v.0.2.1), backports(v.1.5.0), tictoc(v.1.2.1), broom(v.1.0.6), renv(v.1.0.7), simsurv(v.1.0.0), Rcpp(v.1.0.13), checkmate(v.2.3.2), gridExtra(v.2.3), nlme(v.3.1-164), xfun(v.0.47) and pkgconfig(v.2.0.3)

Show the code
pander::pander(options('repos'))
  • repos:

    REPO_NAME
    https://packagemanager.posit.co/cran/linux/noble/latest

4.7